The objective of this notebook is to build an automated human activity recognition system. Main goal is to obtain highest cross-validated activity prediction performance by applying various data preprocessing and machine learning methods and tuning their parameters.
Labeled human acitivity used in this study is publicly available on Kaggle[1].
Throughout this workbook, I will follow an iterative process where I will go back and forward to apply various data visualization, data preprocessing and model-training methods while paying special attention on:
My goal is, eventually, to learn more about the nature of the problem of activity recognition. I will mostly have an application developer view when I discuss the real life implications of the obtained results.
[1] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013 [ https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones ]
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
class_labels = ['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS', 'SITTING', 'STANDING', 'LAYING']
X_train = pd.read_csv('train.csv')
s_train = X_train['subject']
X_train.drop('subject', axis = 1, inplace = True)
y_train = X_train['Activity'].to_frame().reset_index()
X_train.drop('Activity', axis = 1, inplace = True)
y_train = y_train.replace(class_labels, [0, 1, 2, 3, 4, 5])
X_test = pd.read_csv('test.csv')
s_test = X_test['subject']
X_test.drop('subject', axis = 1, inplace = True)
y_test = X_test['Activity'].to_frame().reset_index()
X_test.drop(['Activity'], axis = 1, inplace = True)
y_test = y_test.replace(class_labels, [0, 1, 2, 3, 4, 5])
#NOTE: this contenation method (viz. append) is safer than concat
X = X_train.append(X_test, ignore_index=True)
y = y_train.append(y_test, ignore_index=True)
display(X.describe())
# display(y.describe())
#NOTE: append can adjust the index value of the appended dataframe whereas concatenation cannot. Concatenation may
#result in duplicate indices.
# dataframes = [X_train, X_test]
# X = pd.concat(dataframes)
# dataframes = [y_train, y_test]
# y = pd.concat(dataframes)
Before going into a more detailed work on the features and model training and testing, I will apply some of the supervised machine learning methods to have some idea about the baseline performance of those methods.
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
from time import time
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
def train(clf, features, target):
start = time()
clf.fit(features, target)
end = time()
return end - start
def predict(clf, features):
start = time()
pred = clf.predict(features)
end = time()
return end - start, pred
clf_SGD = SGDClassifier(random_state = 42)
clf_Ada = AdaBoostClassifier(random_state = 42)
clf_DTR = DecisionTreeRegressor(random_state=42)
clf_KNC = KNeighborsClassifier()
clf_GNB = GaussianNB()
clf_SVM = SVC()
clfs = {clf_SGD, clf_Ada, clf_DTR, clf_KNC, clf_GNB, clf_SVM}
y_train_ = y_train['Activity']
y_test_ = y_test['Activity']
y_ = y['Activity']
for clf in clfs:
printout = ""
if clf == clf_SGD: printout = "SGD"
elif clf == clf_Ada: printout = "Ada"
elif clf == clf_DTR: printout = "DTR"
elif clf == clf_KNC: printout = "KNC"
elif clf == clf_GNB: printout = "GNB"
elif clf == clf_SVM: printout = "SVM"
results_precision = []
results_recall = []
results_fscore = []
results_ttrain = []
results_ttest = []
kfold = cross_validation.KFold(X.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
start = time()
clf.fit(X.iloc[train], y_[train])
results_ttrain.append(time()-start)
#NOTE: for some resason this line doesn't work.
# t_train = train(clf, X.iloc[train], y_[train])
t_test, y_pred = predict(clf, X.iloc[test])
results_ttest.append(t_test)
precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
printout += " t_train: {:.4f}sec".format(np.mean(results_ttrain))
printout += " t_pred: {:.4f}sec".format(np.mean(results_ttest))
print printout
SVM and SGD have the highest precision, recall and f1-score. SGD is the quickest in prediction and second quickest in training. I will be using SVM as the main method to build this action recognition system [NOTE: EXPLAIN WHY].
Although the cross-validated prediction performance is high already, there still might be potential improvement opportunities. For instance, removing the outliers is one way to improve the model. We can't visualize a 561-dimensional space in a human readable form but we can still have a look at how the features are distributed individually. I will plot the distribution of some of the features below.
Moreover, some of the features might be redundant. Redundant features can easily be distinguished by investigating the correlation between them and the other features. If there is high correlation with other features that means there is no reason to carry this feature as the information represented by this feature is already conveyed through other features.
Therefore, correlation matrix will allow us to see the distribution of the feature values individually and correlation between the features as shown below. I will use SelectKBest method to choose a subset of features for further investigation. In order to decide on the number K, I will run an exhaustive training batch where I change the number K and monitor che change in the cross-validated prediction performance of the SVM model.
from sklearn.feature_selection import SelectKBest
import matplotlib.pyplot as plt
%matplotlib inline
d_kbest_to_precision = {}
d_kbest_to_recall = {}
d_kbest_to_f1score = {}
kbest_max = X.shape[1]/5
clf = clf_SVM
for kbest in range (2, kbest_max):
f_selector = SelectKBest(k=kbest)
Xs = f_selector.fit(X, y_).transform(X)
printout = "kbest: {:3d}".format(kbest)
results_precision = []
results_recall = []
results_fscore = []
results_ttrain = []
results_ttest = []
kfold = cross_validation.KFold(Xs.shape[0], n_folds=4, shuffle=False, random_state=42)
for train, test in kfold:
start = time()
clf.fit(Xs[train], y_[train])
results_ttrain.append(time()-start)
#NOTE: for some resason this line doesn't work.
# t_train = train(clf, X.iloc[train], y_[train])
t_test, y_pred = predict(clf, Xs[test])
results_ttest.append(t_test)
precision, recall, fscore, support = precision_recall_fscore_support(y_[test], y_pred, average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
printout += " t_train: {:.3f}sec".format(np.mean(results_ttrain))
printout += " t_pred: {:.3f}sec".format(np.mean(results_ttest))
print printout
d_kbest_to_precision[kbest]=np.mean(results_precision)
d_kbest_to_recall[kbest]=np.mean(results_recall)
d_kbest_to_f1score[kbest]=np.mean(results_fscore)
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.grid(True)
major_ticks = np.arange(0, kbest_max, 20)
minor_ticks = np.arange(0, kbest_max, 5)
# ax.set_xticks(major_ticks)
# ax.set_xticks(minor_ticks, minor=True)
plt.xticks(minor_ticks)
plt.plot(d_kbest_to_precision.keys(), d_kbest_to_precision.values(), 'r',
d_kbest_to_recall.keys(), d_kbest_to_recall.values(), 'g',
d_kbest_to_f1score.keys(), d_kbest_to_f1score.values(), 'b')
plt.show()
Precision, recall and fscore values are calculated based on the class weighted average as there is imbalance between the number of class labels in the dataset
496 WALKING,
471 WALKING_UPSTAIRS,
420 WALKING_DOWNSTAIRS,
491 SITTING,
532 STANDING,
537 LAYING
I will consider the best 16 features to further investigate. This is where the classification scores peak for the first time, and it doesn't change that much after that point on.[NOTE: ADD LEGEND TO THE PLOT]
kbest_selected = 16
f_selector = SelectKBest(k=kbest_selected)
f_selector.fit(X, y['Activity'])
f_selected_indices = f_selector.get_support(indices=False)
Xs_cols = X.columns[f_selected_indices]
Xs = X[Xs_cols] # dataset with selected features
# display(Xs.describe())
Having normally distributed features is the fundamental assumption in many predictive models. Normal distribution is un-skewed. It means the probability of falling in the right or left side of the mean is equally likely. As we can see from the correlation matrix above, and the skewness test below, these features are quite skewed, even mostly bimodal. [NOTE: FURTHER DISCUSSION IS NEEDED]
import scipy.stats.stats as st
import operator
skness = st.skew(X)
d_feature2skew = {}
for skew, feature in zip(skness , X.columns.values.tolist()):
d_feature2skew[feature]=skew
feature2skew = sorted(d_feature2skew.items(), key=operator.itemgetter(1), reverse=True)
# for key, value in feature2skew:
# print str(value) + " " + str(key)
# Produce a scatter matrix for each pair of features in the data
axes = pd.scatter_matrix(Xs, alpha = 0.3, figsize = (20,32), diagonal = 'kde')
# Reformat data.corr() for plotting
corr = Xs.corr().as_matrix()
# Plot scatter matrix with correlations
for i,j in zip(*np.triu_indices_from(axes, k=1)):
axes[i,j].annotate("%.2f"%corr[i,j], (0.1,0.25), xycoords='axes fraction', color='red', fontsize=16)
Skewness greater than zero shows a positively skewed distribution, while lower than zero shows a negatively skewed distribution. Replacing the data with the log, square root, or inverse may help to remove the skew. However, feature values of the current dataset with selected features change between -1 and 1. Therefore, sqrt and log is not applicable. If we apply any of those transformations, most of the feature values will turn into NaN and dataset will be useless.
To avoid this we first shift data to a non-negative range, then apply the non-linear transformation, after that scale it back to -1 and 1 to be able to compare the change in the feature distribution with bare eyes. If all go right, we should be able to see less skewed feature distribution.
In addition to sqrt-ing and log-ing, I will also try boxcox-ing to reduce the skewness.
SVM's classification performance for 16, 56, and 561 features are as follows:
n_features: 16 t_train: 0.802sec t_pred: 0.620sec precision: 0.84 recall: 0.81 fscore: 0.79
n_features: 56 t_train: 1.421sec t_pred: 1.247sec precision: 0.88 recall: 0.83 fscore: 0.81
n_features: 561 t_train: 8.916sec t_pred: 7.667sec precision: 0.94 recall: 0.94 fscore: 0.94
Using 16-feature reduced the training and testing time by more than 10 times while losing 10% classification performance measured as precision, recall and fscore. When compared to choosing 56 best features (top 10% of the whole feature vector), we see that 16-feature is almost as good as 56-feature in classification performance. 16-feature is twice faster than 56-feature in training and testing times.
As 16-feature is good enough for SVM, now I will find ways to improve the classification performance by scaling, normalizing and outlier-removal. First, let's have a look at how the features are distributed by using a correlation matrix.
import scipy.stats.stats as st
skness = st.skew(Xs)
for skew, feature_name in zip(skness , Xs_cols.tolist()):
print "skewness: {:+.2f}\t\t feature: ".format(skew) + feature_name
from sklearn import preprocessing
from scipy.stats import boxcox
plt.rcParams['figure.figsize'] = (20.0, 80.0)
f, axarr = plt.subplots(len(Xs_cols.tolist()), 4, sharey=True)
preprocessing_names = ["noproc", "sqrted", "logged", "bxcxed"]
cnt = 0
for feature in Xs_cols.tolist():
for i in range(4):
# axarr[cnt, i].set_title( "[" + preprocessing_names[i] + "] "+ feature)
axarr[cnt, i].set_title(feature + " histogram")
axarr[cnt, i].set_xlabel(feature)
axarr[cnt, i].set_ylabel("number of data points")
Xs_feature = Xs[feature]
skness = st.skew(Xs_feature)
axarr[cnt, 0].hist(Xs_feature,facecolor='blue',alpha=0.75)
axarr[cnt, 0].text(0.05, 0.95, 'Skewness[noproc]: {:.2f}'.format(skness), transform=axarr[cnt, 0].transAxes,
fontsize=12, verticalalignment='top', color='red')
Xs_feature_scaled = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs_feature)
Xs_feature_sqrted = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(np.sqrt(Xs_feature_scaled))
# Xs_feature_sqrted = preprocessing.scale(np.sqrt(Xs_feature_scaled))
skness = st.skew(Xs_feature_sqrted)
axarr[cnt, 1].hist(Xs_feature_sqrted,facecolor='blue',alpha=0.75)
axarr[cnt, 1].text(0.05, 0.95, 'Skewness[sqrted]: {:.2f}'.format(skness), transform=axarr[cnt, 1].transAxes,
fontsize=12, verticalalignment='top', color='green')
Xs_feature_logged = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(np.log(Xs_feature_scaled))
# Xs_feature_logged = preprocessing.scale(np.log(Xs_feature_scaled))
skness = st.skew(Xs_feature_logged)
axarr[cnt, 2].hist(Xs_feature_logged,facecolor='blue',alpha=0.75)
axarr[cnt, 2].text(0.05, 0.95, 'Skewness[logged]: {:.2f}'.format(skness), transform=axarr[cnt, 2].transAxes,
fontsize=12, verticalalignment='top', color='green')
Xs_feature_bxcxed = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(boxcox(Xs_feature_scaled)[0])
# Xs_feature_bxcxed = preprocessing.scale(boxcox(Xs_feature_scaled)[0])
skness = st.skew(Xs_feature_bxcxed)
axarr[cnt, 3].hist(Xs_feature_bxcxed,facecolor='blue',alpha=0.75)
axarr[cnt, 3].text(0.05, 0.95, 'Skewness[bxcxed]: {:.2f}'.format(skness), transform=axarr[cnt, 3].transAxes, fontsize=12,
verticalalignment='top', color='green', bbox=dict(facecolor='white', alpha=0.5, boxstyle='square'))
cnt += 1
plt.show()
#NOTE: Tried robust scaler but it didn't have any effect on the dataset's skewness
Xs_rscaled = preprocessing.RobustScaler().fit_transform(Xs)
print Xs_rscaled.shape
for feature in range(Xs_rscaled.shape[1]):
Xs_rscaled_feature = Xs_rscaled[:,feature]
skness = st.skew(Xs_rscaled_feature)
print "{:2d}".format(feature) + " {:+.2f}".format(skness)
def boxCoxData(data):
data_bxcxed = []
for feature in range(data.shape[1]):
data_bxcxed_feature, maxlog = boxcox(data[:,feature])
if feature == 0:
data_bxcxed = data_bxcxed_feature
else:
data_bxcxed = np.column_stack([data_bxcxed, data_bxcxed_feature])
return data_bxcxed
def ScaleData(data):
data_scaled = []
for feature in range(data.shape[1]):
data_scaled_feature = preprocessing.scale(data[:,feature])
if feature == 0:
data_scaled = data_scaled_feature
else:
data_scaled = np.column_stack([data_scaled, data_scaled_feature])
return data_scaled
def testSVMPerformance(data_train, label_train, data_test, label_test, preprocess_method):
if preprocess_method != "":
data_train = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(data_train)
data_test = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(data_test)
if preprocess_method == "logged":
data_train = np.log(data_train)
data_test = np.log(data_test)
elif preprocess_method == "sqrted":
data_train = np.sqrt(data_train)
data_test = np.sqrt(data_test)
elif preprocess_method == "bxcxed":
data_train = boxCoxData(data_train)
data_test = boxCoxData(data_test)
#this resulted in a more inferior performance compared to preprocessing.scale method
# data_train = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(data_train)
# data_test = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(data_test)
data_train = ScaleData(data_train)
data_test = ScaleData(data_test)
start = time()
clf_SVM.fit(data_train, label_train)
end = time()
t_train = end - start
#NOTE: For some reason this doesn't work here
# t_train = train(clf_SVM, data_train, label_train)
t_test, y_pred = predict(clf_SVM, data_test)
precision, recall, fscore, support = precision_recall_fscore_support(label_test, y_pred, average='weighted')
printout = preprocess_method
if preprocess_method == "":
printout = "noproc"
printout += " t_train: {:.3f}sec".format(t_train)
printout += " t_pred: {:.3f}sec".format(t_test)
printout += " precision: {:.2f}".format(precision)
printout += " recall: {:.2f}".format(recall)
printout += " fscore: {:.2f}".format(fscore)
print printout
X_train_processed = X_train[Xs_cols]
X_test_processed = X_test[Xs_cols]
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "scaled")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "logged")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "sqrted")
testSVMPerformance(X_train_processed, y_train['Activity'], X_test_processed, y_test['Activity'], "bxcxed")
It is time to test if there is any outlier in the boxcoxed dataset.
Xs_processed = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs)
Xs_bxcxed = boxCoxData(Xs_processed)
Xs_bxcxed_scaled = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(Xs_bxcxed)
outliers = []
for feature in range(Xs_bxcxed_scaled.shape[1]):
Q1 = np.percentile(Xs_bxcxed_scaled[:, feature], 25)
Q3 = np.percentile(Xs_bxcxed_scaled[:, feature], 75)
step = 1.5 * (Q3 - Q1)
outlier_filter = ~((Xs_bxcxed_scaled[:, feature] >= Q1 - step) & (Xs_bxcxed_scaled[:, feature] <= Q3 + step))
cnt = 0
for outlier in outlier_filter:
if outlier:
outliers.append(cnt)
cnt += 1
# print "number of outliers with repeating indices: " + str(len(outliers))
id2cnt = {}
for outlier in outliers:
if not outlier in id2cnt:
id2cnt[outlier] = 1
else:
id2cnt[outlier] += 1
sorted_id2cnt = sorted(id2cnt.items(), key=operator.itemgetter(1), reverse=True)
cnt2nindices = {}
for key, value in sorted_id2cnt:
#only remove the outliers that are repeated more than once
if value <=1:
break
if not value in cnt2nindices:
cnt2nindices[value] = 1
else:
cnt2nindices[value] += 1
for key, value in cnt2nindices.iteritems():
print "{:2d} features share {:4d} potential outliers".format(key, value)
Let's try to remove those 1953 potential outliers and test the performance of the SVM again. Although, this seems like losing too much data, I just want to see how this may effect the learning performance.
removed_outliers = []
for key, value in sorted_id2cnt:
if value == 3:
removed_outliers.append(key)
y_labels = y['Activity']
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
clf_SVM.fit(Xs.iloc[train], y_labels.iloc[train])
# t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
t_test, y_pred = predict(clf_SVM, Xs.iloc[test])
precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout = "subsetsize: {:5d}".format(Xs.shape[0])
# printout += " t_train: {:.3f}sec".format(t_train)
# printout += " t_pred: {:.3f}sec".format(t_test)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
print printout
Xs_filtered = Xs.drop(removed_outliers)
y_filtered = y.drop(removed_outliers)
y_filtered_labels = y_filtered['Activity'].to_frame()
Xs_filtered_proc = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs_filtered)
Xs_filtered_proc = boxCoxData(Xs_filtered_proc)
Xs_filtered_proc = ScaleData(Xs_filtered_proc)
Xs_filtered_proc = Xs_filtered_proc
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_filtered_proc.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
clf_SVM.fit(Xs_filtered_proc[train], y_filtered_labels.iloc[train])
t_test, y_pred = predict(clf_SVM, Xs_filtered_proc[test])
precision, recall, fscore, support = precision_recall_fscore_support(y_filtered_labels.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
print "**************"
printout = "subsetsize: {:5d}".format(Xs_filtered_proc.shape[0])
# printout += " t_train: {:.3f}sec".format(t_train)
# printout += " t_pred: {:.3f}sec".format(t_test)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
print printout
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_filtered.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
clf_SVM.fit(Xs_filtered.iloc[train], y_filtered_labels.iloc[train])
# t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
t_test, y_pred = predict(clf_SVM, Xs_filtered.iloc[test])
precision, recall, fscore, support = precision_recall_fscore_support(y_filtered_labels.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout = "subsetsize: {:5d}".format(Xs_filtered.shape[0])
# printout += " t_train: {:.3f}sec".format(t_train)
# printout += " t_pred: {:.3f}sec".format(t_test)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
print printout
from random import sample
n_multiplier = Xs.shape[0]/500
for i in range(1, n_multiplier+1):
subsetsize = i*500
random_index = sample(range(0, Xs.shape[0]), subsetsize)
Xs_subset = Xs.iloc[random_index]
y_subset = y_labels.iloc[random_index].to_frame()
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(Xs_subset.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
clf_SVM.fit(Xs_subset.iloc[train], y_subset.iloc[train])
# t_train = train(clf_SVM, Xs_subset.iloc[train], y_subset.iloc[train])
t_test, y_pred = predict(clf_SVM, Xs_subset.iloc[test])
precision, recall, fscore, support = precision_recall_fscore_support(y_subset.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout = "subsetsize: {:5d}".format(subsetsize)
# printout += " t_train: {:.3f}sec".format(t_train)
# printout += " t_pred: {:.3f}sec".format(t_test)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
print printout
#TODO: TRY PCA
#TODO: merge featuers and re-run
#TODO: I guess that's it!
This shows that feature preprocessing and outlier removal are tied together. In other words, detected outliers are special to the space they are transformed to through preprocessing methods. Therefore, outliers in transformed space may not be outliers in the original space. Following results show that removing the outliers is only good if the learning is done on the space where those features transformed to.
subsetsize: 8346 precision: 0.87 recall: 0.86 fscore: 0.86 (features are preprocessed)
subsetsize: 8346 precision: 0.80 recall: 0.76 fscore: 0.74 (features are kept as the way they are)
from sklearn.decomposition import PCA
# for n_components in range (2, 20):
for n_components in range (2, 400, 4):
pca = PCA(n_components=n_components).fit(X)
# print pca.explained_variance_ratio_
printout = "n_components: {:d}".format(n_components)
X_pcaed = pca.transform(X)
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(X_pcaed.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
clf_SVM.fit(X_pcaed[train], y_labels.iloc[train])
t_test, y_pred = predict(clf_SVM, X_pcaed[test])
precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
print printout
Let's combine best k features with pca's 19 components and train the SVM again.
#create dataframe for the first 19 components of the PCAed dataset
column_names = []
for i in range(X_pcaed.shape[1]):
column_names.append("component{:2d}".format(i))
n_components = 19
pca = PCA(n_components=n_components).fit(X)
X_pcaed = pca.transform(X)
X_pcaed = pd.DataFrame(data=X_pcaed, index=range(X_pcaed.shape[0]), columns=column_names)
# print X_pcaed
#transform the features of X
Xs_scaled = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(Xs)
Xs_bxcxed = boxCoxData(Xs_scaled)
Xs_bxcxed = ScaleData(Xs_bxcxed)
# print Xs_bxcxed.shape
Xs_bxcxed = pd.DataFrame(data=Xs_bxcxed, index=range(Xs_bxcxed.shape[0]), columns=Xs_cols)
# print X_pcaed
# print Xs_bxcxed
# X_combined = Xs_bxcxed.add(X_pcaed, axis='columns')
# X_combined = Xs_bxcxed + X_pcaed
X_combined = pd.concat([Xs_bxcxed, X_pcaed], axis=1)
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(X_combined.shape[0], n_folds=10, shuffle=False, random_state=42)
for train, test in kfold:
clf_SVM.fit(X_combined.iloc[train], y_labels.iloc[train])
t_test, y_pred = predict(clf_SVM, X_combined.iloc[test])
precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout = "combined"
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}".format(np.mean(results_fscore))
print printout
from sklearn.feature_selection import SelectKBest
from scipy.stats import boxcox
from sklearn import preprocessing
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
def boxCoxData(data):
data_bxcxed = []
for feature in range(data.shape[1]):
data_bxcxed_feature, maxlog = boxcox(data[:,feature])
if feature == 0:
data_bxcxed = data_bxcxed_feature
else:
data_bxcxed = np.column_stack([data_bxcxed, data_bxcxed_feature])
return data_bxcxed
def ScaleData(data):
data_scaled = []
for feature in range(data.shape[1]):
data_scaled_feature = preprocessing.scale(data[:,feature])
if feature == 0:
data_scaled = data_scaled_feature
else:
data_scaled = np.column_stack([data_scaled, data_scaled_feature])
return data_scaled
def predict(clf, features):
start = time()
pred = clf.predict(features)
end = time()
return end - start, pred
# kbest_param_vals = [5, 10, 15, 20, 30, 50, 100, 200, Xs.shape[1]]
kbest_param_vals = [X.shape[1]]
pca_n_components = [2, 5, 10, 15, 20, 30, 40, 50, 100, 200, 400]
for kbest in kbest_param_vals:
start = time()
#choose kbest feature dimensions
f_selector = SelectKBest(k=kbest)
X_slctd = f_selector.fit(X, y['Activity']).transform(X)
f_selected_indices = f_selector.get_support(indices=False)
X_slctd_cols = X.columns[f_selected_indices]
#transform these features to another space where they are less skewed
X_slctd_tformed = preprocessing.MinMaxScaler(feature_range=(1, 2), copy=True).fit_transform(X_slctd)
X_slctd_tformed = boxCoxData(X_slctd_tformed)
X_slctd_tformed = preprocessing.MinMaxScaler(feature_range=(-1, 1), copy=True).fit_transform(X_slctd_tformed)
X_slctd_tformed = pd.DataFrame(data=X_slctd_tformed, index=range(X_slctd_tformed.shape[0]), columns=X_slctd_cols)
end = time()
for pca_n in pca_n_components:
column_names = []
for i in range(pca_n):
column_names.append("component{:2d}".format(i))
start_pca = time()
pca = PCA(n_components=pca_n).fit(X)
X_pcaed = pca.transform(X)
X_pcaed = pd.DataFrame(data=X_pcaed, index=range(X_pcaed.shape[0]), columns=column_names)
X_combined = pd.concat([X_slctd_tformed, X_pcaed], axis=1)
end_pca = time()
t_proc = (end - start) + (end_pca - start_pca)
results_precision = []
results_recall = []
results_fscore = []
kfold = cross_validation.KFold(X_combined.shape[0], n_folds=10, shuffle=False, random_state=42)
t_trains = []
t_tests = []
for train, test in kfold:
t_train_s = time()
clf_SVM.fit(X_combined.iloc[train], y_labels.iloc[train])
t_trains.append( time() - t_train_s )
t_test, y_pred = predict(clf_SVM, X_combined.iloc[test])
t_tests.append(t_test)
precision, recall, fscore, support = precision_recall_fscore_support(y_labels.iloc[test], y_pred,
average='weighted')
results_precision.append(precision)
results_recall.append(recall)
results_fscore.append(fscore)
printout = "(kbest{:3d})(pca_n{:3d})".format(kbest, pca_n)
printout += " precision: {:.2f}".format(np.mean(results_precision))
printout += " recall: {:.2f}".format(np.mean(results_recall))
printout += " fscore: {:.2f}\t".format(np.mean(results_fscore))
printout += " t_proc: {:.2f} t_train: {:.2f} t_test: {:.2f}".format(t_proc, np.mean(t_trains), np.mean(t_tests))
print printout
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
return r2_score(y_true, y_predict)
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import ShuffleSplit
def fit_model(X, y):
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
params = {'max_depth': range(1,20)}
# TODO: Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
regressor = DecisionTreeRegressor(max_depth = params['max_depth'], random_state=42)
grid = GridSearchCV(regressor, param_grid=params, scoring=scoring_fnc)
grid = grid.fit(X, y)
return grid.best_estimator_
# clf = fit_model(X,y)
clf = fit_model(X_train,y_train)
print clf.score(X_train, y_train)
print clf.score(X_test, y_test)
from sklearn.decomposition import PCA
n_components = 2
pca = PCA(n_components=n_components).fit(X)
print pca.explained_variance_ratio_
# TODO: Transform the good data using the PCA fit above
reduced_data = pca.transform(X_train)
print X_train.shape
print reduced_data.shape
# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
print reduced_data.shape
# Produce a scatter matrix for pca reduced data
pd.scatter_matrix(reduced_data, alpha = 0.8, figsize = (8,4), diagonal = 'kde');
for n_components in range (2, 20):
pca = PCA(n_components=n_components).fit(X)
# print pca.explained_variance_ratio_
Xr_train = pca.transform(X_train)
clf = fit_model(Xr_train,y_train)
Xr_test = pca.transform(X_test)
print "ncomponents:" + str(n_components) + " " + str(clf.score(Xr_train, y_train)) + " " + str(clf.score(Xr_test, y_test))